DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation Supplementary Material
نویسندگان
چکیده
First, detailed performance analysis is performed when evaluating various parameters of AFR-CNN and results are reported using PCK [13] evaluation measure. Then, performance of the proposed AFR-CNN and Dense-CNN part detection models is evaluated using strict PCP [4] measure. Detailed AFR-CNN performance analysis (PCK). Detailed parameter analysis of AFR-CNN is provided in Tab. 1 and results are reported using PCK evaluation measure. Respecting parameters for each experiment are shown in the first column and parameter differences between the neighboring rows in the table are highlighted in bold. Re-scoring the 2000 DPM proposals using AFR-CNN with AlexNet [8] leads to 56.9% PCK. This is achieved using basis scale 1 (≈ head size) of proposals and training with initial learning rate (lr) of 0.001 for 80k iterations, after which lr is reduced by 0.1, for a total number of 140k SGD iterations. In addition, bounding box regression and default IoU threshold of 0.5 for positive/negative label assignment [5] have been used. Extending the regions by 4x increases the performance to 65.1% PCK, as it incorporates more context including the information about symmetric body parts and allows to implicitly encode higher-order body part relations into the part detector. No improvements observed for larger scales. Increasing lr to 0.003, lr reduction step to 160k and training for a larger number of iterations (240k) improves the results to 67.4, as higher lr allows for for more significant updates of model parameters when finetuned on the task of human body part detection. Increasing the number of training examples by reducing the training IoU threshold to 0.4 results into slight performance improvement (68.8 vs. 67.4% PCK). Further increasing the number of training samples by horizontally flipping each image and performing translation and scale jittering of the ground truth training samples improves the performance to 69.6% PCK and 42.3% AUC. The improvement is more pronounced for smaller distance thresholds (42.3 vs. 40.9% AUC): localization of body parts is improved due to the increased number of jittered samples that significantly overlap with the ground truth. Further increasing the lr, lr reduction step and total number of iterations altogether improves the performance to 72.4% PCK, and very minor improvements are observed when training longer. All results above are achieved by finetuning the AlexNet architecture from the ImageNet model on the MPII training set. Further finetuning the MPII-finetuned model on the LSP training set increases the performance to 77.9% PCK, as the network learns LSP-specific image representations. Using the deeper VGG [14] architecture improves over more shallow AlexNet (77.9 vs. 72.4% PCK, 50.0 vs. 44.6% AUC). Funetuning VGG on LSP achieves remarkable 82.8% PCK and 57.0% AUC. Strong increase in AUC (57.0 vs. 50%) characterizes the improvement for smaller PCK evaluation thresholds. Switching off bounding box regression results into performance drop (81.3% PCK, 53.2% AUC) thus showing the importance of the bounding box regression for better part localization. Overall, we demonstrate that proper adaptation and tweaking of the state-of-the-art generic object detector FR-CNN [5] leads to a strong body part detection model that dramatically improves over the vanilla FR-CNN (82.8 vs. 56.9% PCK, 57.8 vs. 35.9% AUC) and significantly outperforms the state of the art (+9.4% PCK over the best known PCK result [1] and +9.7% AUC over the best known AUC result [15]. Overall performance using PCP evaluation measure.
منابع مشابه
Generative Partition Networks for Multi-Person Pose Estimation
This paper proposes a new Generative Partition Network (GPN) to address the challenging multi-person pose estimation problem. Different from existing models that are either completely top-down or bottom-up, the proposed GPN introduces a novel strategy—it generates partitions for multiple persons from their global joint candidates and infers instance-specific joint configurations simultaneously....
متن کاملSingle-Shot Multi-Person 3D Body Pose Estimation From Monocular RGB Input
We propose a new efficient single-shot method for multiperson 3D pose estimation in general scenes from a monocular RGB camera. Our fully convolutional DNN-based approach jointly infers 2D and 3D joint locations on the basis of an extended 3D location map supported by body part associations. This new formulation enables the readout of full body poses at a subset of visible joints without the ne...
متن کاملMulti-person Pose Estimation with Local Joint-to-Person Associations
Despite of the recent success of neural networks for human pose estimation, current approaches are limited to pose estimation of a single person and cannot handle humans in groups or crowds. In this work, we propose a method that estimates the poses of multiple persons in an image in which a person can be occluded by another person or might be truncated. To this end, we consider multiperson pos...
متن کاملHarvesting Multiple Views for Marker-less 3D Human Pose Annotations Supplementary Material
In this supplementary, we provide material that could not be included in the main manuscript due to space constraints. Section 1 provides additional quantitative evaluation of our approach for multi-view pose estimation, and comparison with the state-of-the-art for HumanEva-I [4]. Section 2 provides full results of the multi-view optimization on Human3.6M after refining the generic 2D pose Conv...
متن کاملAssociative Embedding: End-to-End Learning for Joint Detection and Grouping
We introduce associative embedding, a novel method for supervising convolutional neural networks for the task of detection and grouping. A number of computer vision problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking. Usually the grouping of detections is achieved with multi-stage pipelines, instead we propose an approac...
متن کامل